O projeto

Neste projeto, você irá usar o R e aplicar técnicas de análise exploratória de dados para verificar relações em uma ou mais variáveis e explorar um conjunto de dados específico para encontrar distribuições, outliers e anomalias.

Análise Exploratório de dados (Exploratory Data Analysis, ou EDA) é a análise numérica e visual das características de dados e seus relacionamentos usando métodos formais e estratégias estatísticas.

EDA pode nos trazer insights, que podem nos levar a novas questões, e eventualmente a modelos preditivos. É uma importante “linha de defesa” contra dados ruins e uma oportunidade de comprovar se suas suposições ou intuições sobre um conjunto estão sendo violadas.

Introdução

Essa análise irá explorar um conjunto de dados de vinhos tintos [Cortez et al., 2009], originalmente construído para modelagem da qualidade do vinho refletida por aspectos químicos de cada bebida. Obtive a ajuda de um amigo formado em química para me guiar em possíveis aspectos quimícos que podem gerar um gosto desagradável no vinho, e sob essas hipoteses guiarei minha analise.

Seção de Gráficos Univariados

Visão Geral

Para iniciar iremos analisar cada variável separadamente para termos uma ideia do que estamos lidando:

##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

Podemos ver que os dados estão bem formatados e embora algumas colunas aparentem ter outliers nada parece fora do normal.

Primeiro removemos a coluna de index que não é necessária.

Qualidade

Começaremos pela variável qualidade:

Embora tenhamos notas possiveis de 0 a 10 os dados apresentam notas apenas no intervalo 3-8 com pico no 5 e poucos exemplos nos extremos, olharemos de forma mais detalhada:

Vinhos piores:

## [1] 63

Vinhos melhores:

## [1] 217

Apenas 18 vinhos receberam a nota mais alta dos jurados e os de qualidade baixa também se encontram com pouca representatividade, iremos voltar a essa analise posteriormente.

Álcool

Agora analisaremos a quantidade de álcool.

A quantidade de álcool mais comum está por volta de 9.4, com uma distribuição bem irregular (talvez uma binormal), talvez seja interessante criar subconjuntos das diferentes qualidades de vinhos para analisar melhor.

Não está muito claro devido a baixa amostragem de dados para binhos bons mas aparenta que vinhos melhores tenham mais álcool que vinhos ruins, suponho que pelo tempo de fermentação que vinhos melhores levam eles acumulam mais alcool, mas para ter mais confiança dessa afirmação é necessário uma analise de regressão.

Açúcar residual

Agora analisaremos o açúcar residual dos nossos vinhos contém.

Com uma distribuição de cauda pesada devemos setar aumentar a precisão no eixo x e aumentar a quantidade de barras para visualizar melhor.

Existe um pico ao redor do 2, vamos analisar essa região.

Neste intervalo os dados parecem estar distribuidos de forma normal, sendo onde a maioria dos vinhos se encontram, para as outras regiões talvez encontremos outliers quanto a qualidade do vinho, vinhos muito doces tendem a ser considerados ruins.

Agora voltemos a analisar a distruibuição de cauda pesada, para isso renormalizamos aplicando uma scala logaritmica.

Bem melhor, agora podemos ver um mini pico para os dados acima de 10.

Vamos analisar agora o açúcar residual nos vinhos outliers:

As modas estão em 2 porém os vinhos ruins possuem outliers a muitos desvios padrões da média (13), e as distribuições são de cauda pesada.

Cloretos

Cloretos indicam a salinidade no vinhos, não podendo conter em excesso, estragando o vinho.

Aqui também com cauda pesada iremos aplicar a transformação log.

Como é visivel, existe uma grande acumulação entre 0.07 e 0.09, e outliers a esquerda e direita.

Vejamos como eles desempenham:

Os de pouca salinidade tiveram notas altas, interessante.

pH

Vemos agora o pH que descreve a acidez/basicidade do vinho na escala de 0 a 14.

Aqui vemos uma distribuição normal e bem centrada, vejamos a relação com a qualidade dos vinhos.

Não é visivel nenhuma diferença significativa entre os vinhos.

Densidade

A densidade depende da quantidade de alcool e açucar residual, vejamos como está essa distribuição.

Nada fora do comum por aqui, mas vejamos como está em relação a qualidade.

Não há uma separação significativa entre as distribuições.

Ácido citrico

Uma das principais caracteristicas do sabor do vinho, talvez a mais interessante dos dados.

Os dados estão com uma distribuição muito estranha, não sendo claro alguma forma de analisa-los, mas como esperado é uma caracteristica distoante entre os vinhos. Vejamos mais de perto entre os picos:

Vamos ver agora a concentração para vinhos bons e ruins separadamente:

Para os vinhos ruins está uma cauda pesada com centro a esquerda e esparsa, ja para os vinhos bons uma distribuição talvez binormal.

Sulfatos

Sulfatos são adicionados ao vinho para controlar aspectos na fabricação, não interferindo muito no produto final.

Com cauda pesada novamente iremos aplicar uma transformação log.

Agora temos um histograma mais centralizado com varios picos e alguns outliers, acredito que tais picos sejam dados pelo arrendondamento já que estamos em um intervalo pequeno.

Novamente analisando em relação a vinhos bons e ruins.

Vinhos ruins estão com outliers com valores bem altos, talvez isso colabore na pessima nota.

Acidez fixada e volatil

Aqui analisamos a acidez volatil, em excesso pode deixar o vinho com gosto de vinagre.

Para essa distribuição temos varios outliers de valores bem altos, acredito que esses vinhos tenham recebido nota ruim, valor analisar:

Pelo gráfico podemos ver que isso não é um fator determinante na qualidade do vinho, sendo as distribuições pertencendo ao mesmo intervalo.

Agora para acidez volatil:

A distribuição está bem inregular e acredito que novamente seja pelo truncamento, agora as distribuições para vinhos bons e ruins.

A distribuição para vinhos bons parece que foi deslizada a esqueda e menos espaçada

Dioxido de enxofre

Comecemos a analise do SO2 pelo enxofre livre:

Quase todas estão a menos de 60, vamos dar um zoom nisso.

A distribuição tem um pico proximo do valor 7.

Agora comparando para vinhos bons e ruins:

Para os vinhos ruins vemos uma distribuição mais larga, porem os vinhos bons estão contidos nesse intervalo.

Vamos criar a variavel bound, que é o enxofre total menos o enxofre livre:

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   12.00   21.00   30.59   39.00  251.50

Agora visualizamos alguns histogramas para ver como se comporta.

Temos alguns outliers vamos ver mais proximo.

Temos aqui vinhos de boa qualidade.

Comparando bons com ruins para essa variável:

Vinhos bons tem um pico muito maior e outliers maiores também.

Agora analisando a quantidade total de enxofre:

Esse histograma mostra 2 pontos de outlier, vamos dar uma olhada neles.

são os mesmos vinhos de boa qualidade que obtivemos para o enxofre ligado.

Agora o comparativo das distribuições para vinhos bons e ruins.

The poor wines histogram peaks at 109 and then at 189, whereas the excellent wines histogram shows two distinct peaks situated fairly close to each other - at 99 and 119. Also, poor wine samples are more spread out across the X axis, and the poor wines distribution seems to have a left tail.

Análise Univariada

Qual é a estrutura do conjunto de dados?

O conjunto de dados tem 1599 registros com 11 variáveis (de aspecto químico) sendo elas fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates e alcohol + qualidade do vinho (de 0 a 10) reportada por profissionais da área.

Quais são os principais atributos de interesse deste conjunto de dados?

O atributo de interesse é a qualidade do vinho dado que tal dataset foi construido com o objetivo de fazer uma analise estatistica sobre quais fatores influenciam na qualidade do vinho.

Quais outros atributos você acha que podem lhe auxiliar na investigação destes atributos de interesse?

Pela analise até o momento a maioria dos fatores contribui para a qualidade do vinho, porém pH, enxofre e cloretos me paraceram mais interessante.

Você criou novas variáveis a partir dos atributos existentes no conjunto de dados?

Criei, na seção de enxofre, criei a variável bound sulfur que é o enxofre total menos o enxofre livre, sendo esse bound o enxofre ligado a outras moleculas no vinho.

Dos atributos investigados, distribuições incomuns foram encontradas? Você aplicou operações nos dados para limpá-los, ajustá-los ou mudar a forma dos dados? Se sim, por quê?

Foram encontradas diversas distribuições com outliers e de cauda pesada, os outliers analisei em graficos separadamente e as distribuições de cauda pesada apliquei a função logaritimica tornando minha distribuição normalizada, facilitando a analise

Seção de Gráficos Bivariados

Nessa seção analisaremos as relações entre as features par a par.

Temos uma correlação significativa para a qualidade do vinho apenas para a variável alcool, o que a principio desestimula uma analise mais profunda, porém existem relações entre mais variáveis que por enquanto nos estão ocultas, além de transformações que podem ser feitas tornando as relações lineares.

Citando as relações par a par, vemos que algumas variáveis estão bem relacionadas, densidade e acidez fixada, pH e acidez fixada, enxofre ligado e enxofre total e outras não citadas menos relacionadas, variando positivamente e negativamente.

Scatter plot das correlações positivas

Esse par apresenta a maior correlação positiva.

Para esses outros dois plots vemos dados bem espalhados, sem nenhuma relação não-linear clara.

Aqui vemos uma correlação positiva, quanto maior a quantidade de alcool, mais provavel o vinho ter uma nota mais alta.

Esses dois pares tem pouca correlação com entre as variáveis, sendo as distribuições bem concentradas proximo a origem.

Scatter plot relações negativas

Aqui a correlação indica que vinhos de maior densidade apresentam menos alcool e de menor densidade mais alcool.

Aqui também vemos uma correlação fraca entre alcool e açúcar residual.

Aqui a indicios de uma correlação não muito forte entre enxofre ligado e o inverso da quantidade de alcool.

Temos aqui duas correlação inversamente fortes, ph e acidez fixada, alcool e cloros.

Box plots de qualidade

Aqui investigamos as distribuições da relação entre notas e aspecto quimico:

Qualidade e acidez fixada

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   6.700   7.150   7.500   8.360   9.875  11.600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.600   6.800   7.500   7.779   8.400  12.500 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.100   7.800   8.167   8.900  15.900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.700   7.000   7.900   8.347   9.400  14.300 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   4.900   7.400   8.800   8.872  10.100  15.600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.000   7.250   8.250   8.567  10.225  12.600

Qualidade e acidez volatil

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4400  0.6475  0.8450  0.8845  1.0100  1.5800 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.230   0.530   0.670   0.694   0.870   1.130 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.180   0.460   0.580   0.577   0.670   1.330 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1600  0.3800  0.4900  0.4975  0.6000  1.0400 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3000  0.3700  0.4039  0.4850  0.9150 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2600  0.3350  0.3700  0.4233  0.4725  0.8500

Qualidade e acidez citrica

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0050  0.0350  0.1710  0.3275  0.6600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0300  0.0900  0.1742  0.2700  1.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2300  0.2437  0.3600  0.7900 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.0900  0.2600  0.2738  0.4300  0.7800 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3050  0.4000  0.3752  0.4900  0.7600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0300  0.3025  0.4200  0.3911  0.5300  0.7200

Qualidade e açucar residual

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.875   2.100   2.635   3.100   5.700 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.300   1.900   2.100   2.694   2.800  12.900 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   1.900   2.200   2.529   2.600  15.500 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.477   2.500  15.400 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.200   2.000   2.300   2.721   2.750   8.900 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.400   1.800   2.100   2.578   2.600   6.400

Qualidade e cloretos

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0610  0.0790  0.0905  0.1225  0.1430  0.2670 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04500 0.06700 0.08000 0.09068 0.08900 0.61000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03900 0.07400 0.08100 0.09274 0.09400 0.61100 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.03400 0.06825 0.07800 0.08496 0.08800 0.41500 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.06200 0.07300 0.07659 0.08700 0.35800 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.04400 0.06200 0.07050 0.06844 0.07550 0.08600

Qualidade e dioxido de enxofre livre

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     3.0     5.0     6.0    11.0    14.5    34.0 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   12.26   15.00   41.00 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    9.00   15.00   16.98   23.00   68.00 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    8.00   14.00   15.71   21.00   72.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00   11.00   14.05   18.00   54.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    6.00    7.50   13.28   16.50   42.00

Qualidade e dioxido de enxofre ligado

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00    6.75   11.00   13.90   13.75   37.00 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00    8.00   14.00   23.98   32.00  107.00 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   14.00   29.00   39.53   58.00  128.00 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    3.00   11.00   19.00   25.16   33.00  126.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.00    8.50   15.00   20.97   21.50  251.50 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00    9.25   11.00   20.17   22.75   76.00

Qualidade e dioxido de enxofre total

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0    12.5    15.0    24.9    42.5    49.0 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   14.00   26.00   36.25   49.00  119.00 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   26.00   47.00   56.51   84.00  155.00 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   23.00   35.00   40.87   54.00  165.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   17.50   27.00   35.02   43.00  289.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12.00   16.00   21.50   33.44   43.00   88.00

Qualidade e densidade

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9947  0.9961  0.9976  0.9975  0.9988  1.0008 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9934  0.9957  0.9965  0.9965  0.9974  1.0010 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9926  0.9962  0.9970  0.9971  0.9979  1.0031 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9954  0.9966  0.9966  0.9979  1.0037 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9906  0.9948  0.9958  0.9961  0.9974  1.0032 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9908  0.9942  0.9949  0.9952  0.9972  0.9988

Qualidade e pH

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.160   3.312   3.390   3.398   3.495   3.630 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.300   3.370   3.382   3.500   3.900 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.200   3.300   3.305   3.400   3.740 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.860   3.220   3.320   3.318   3.410   4.010 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.920   3.200   3.280   3.291   3.380   3.780 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.880   3.163   3.230   3.267   3.350   3.720

Qualidade e sulfatos

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5125  0.5450  0.5700  0.6150  0.8600 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.4900  0.5600  0.5964  0.6000  2.0000 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.370   0.530   0.580   0.621   0.660   1.980 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.4000  0.5800  0.6400  0.6753  0.7500  1.9500 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3900  0.6500  0.7400  0.7413  0.8300  1.3600 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6300  0.6900  0.7400  0.7678  0.8200  1.1000

Qualidade e alcool

## wine$quality: 3
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   8.400   9.725   9.925   9.955  10.575  11.000 
## -------------------------------------------------------- 
## wine$quality: 4
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.00    9.60   10.00   10.27   11.00   13.10 
## -------------------------------------------------------- 
## wine$quality: 5
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     8.5     9.4     9.7     9.9    10.2    14.9 
## -------------------------------------------------------- 
## wine$quality: 6
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.80   10.50   10.63   11.30   14.00 
## -------------------------------------------------------- 
## wine$quality: 7
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.20   10.80   11.50   11.47   12.10   14.00 
## -------------------------------------------------------- 
## wine$quality: 8
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    9.80   11.32   12.15   12.09   12.88   14.00

Plots de densidade

Fazendo agora um grafico da densidade divididos pela qualidade.

In all these plots, we can clearly see a bimodal distribution for the best wines. I guess this effect is due to there being very few wine samples with grade 9 in the dataset that take on just several values. Our analysis might have benefited from a greater number of highest-quality wines, as we could’ve checked whether this pronounced bimodality has to do with insufficient data or there’re some other factors at play.

As for the last density plot for residual sugar, the distributions seem quite skewed - and indeed, in the first section of this analysis, we’ve found out that the residual sugar distribution has a very heavy right tail. Let’s now try rebuilding the same density plot, but with the residual sugar variable log-transformed.

Now it becomes obvious this distribution is actually bimodal across all wine grades! Pretty curious finding that I can’t explain right away for the lack of the domain knowledge. It might even be a phenomenon peculiar to Portuguese wines - it’s really hard to tell without having more data handy.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The strongest positive correlation involving quality is quality vs alcohol (0.436). One particularly interesting thing here is that an upward trend (quality increases as alcohol content grows) holds true only for higher-quality wines, starting from the grade of 6; below this point the trend is actually downward: for wine samples graded 3-5, the lower the alcohol level, the better the wine. The median alcohol value of less than 12 indicates that a wine sample’s maximum score is 7, which might help us tell a good wine sample from a poor one.

The most pronounced negative correlation that has to do with our main feature is observed in the pair quality - density (-0.307). The general trend there is a downward one: with each grade, median density decreases a bit, with a notable exception of one group - wine samples of grade 5, which break this trend and actually have the greatest median density of all grades. The exactly same picture can be seen in quality vs bound SO2 (-0.218): grade 5 wines once again break the generally downward trend.

Another interesting pattern was discovered in the pair quality vs volatile acidity (-0.195): median values there seemed to change in a wave-like fashion from one grade to another, going up and down a few times.

One more curious finding was that the residual sugar distribution, which is highly skewed initially, when log-transformed and color-coded by quality, is actually bimodal across all the wine grades, from lowest to highest. As I said above, under the relevant plot, I might be lacking some specialist knowledge to draw the right conclusion based on this fact, or it might just be a peculiarity of Portuguese wines, white ones in particular.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Fun fact: positive correlations were dominated by density (3 occurrences out of 6), negative ones by alcohol, featured even more prominently (5 occurrences out of 6). Therefore, it’s only natural that these two features produced the most highly correlated pairs (which I’m talking about in more detail in the subsection below), and density had a part in both of them!

Among other things, total SO2 and bound SO2 turned out to be positively correlated with both density and residual sugar. As for the negative correlation, one of the strongest relationships were observed in such pairs as: total SO2 and free SO2 vs alcohol; pH vs fixed acidity; alcohol vs residual sugar and chlorides.

What was the strongest relationship you found?

Surprisingly enough, the two most pronounced correlations didn’t involve the main variable, quality, but instead featured density, which seems to be heavily dependent on both residual sugar and alcohol content. In the former case, the correlation is positive and equals 0.839; in the latter case, the features are negatively correlated (-0.78).

Multivariate Plots Section

In the previous section, we used box plots to see how different variables are distributed across wine grades and scatter plots to discover interesting pairwise relationships between the features. This section allows us to take our analysis one step further by combining the two techniques and examining what relationships the features display (and how these relationships vary) across wine grades.

Scatter plots faceted by quality

Let’s first take a look at a couple of scatter plots for the features that exhibited the strongest correlation, faceted by quality.

Looks like no surprises here. Scatter plots demonstrate the same trends across all wine quality grades: upward for density vs residual sugar and downward for density vs alcohol.

I wonder what plots would look like for less correlated features.

For the lowest-quality wines, alcohol doesn’t seem to be correlated with residual sugar at all, with a negative trend becoming more noticeable towards higher wine grades.

Somewhat similar picture here. In case of the worst and best wines, alcohol and total So2 are much less correlated (if correlated at all) as compared with wine samples of other grades, which all display a more prominent downward trend.

This time the weakest correlation between the features takes place with the best wine samples. In all other cases, an upward trend is obvious.

Building a simple linear model

We’ll now build a pretty straightforward linear model to see how well it can predict wine quality based on the features we’ve analyzed.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = wine)
## m2: lm(formula = quality ~ alcohol + residual.sugar, data = wine)
## m3: lm(formula = quality ~ alcohol + residual.sugar + density, data = wine)
## m4: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity, 
##     data = wine)
## m5: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity + 
##     pH, data = wine)
## m6: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity + 
##     pH + sulphates, data = wine)
## m7: lm(formula = quality ~ alcohol + residual.sugar + density + volatile.acidity + 
##     pH + sulphates + free.sulfur.dioxide, data = wine)
## 
## =========================================================================================================================
##                             m1            m2            m3            m4            m5            m6            m7       
## -------------------------------------------------------------------------------------------------------------------------
##   (Intercept)              1.875***      1.882***    -42.884***    -24.273*      -13.811        -0.150         2.280     
##                           (0.175)       (0.176)      (12.051)      (11.433)      (11.858)      (11.944)      (12.107)    
##   alcohol                  0.361***      0.361***      0.401***      0.339***      0.346***      0.325***      0.320***  
##                           (0.017)       (0.017)       (0.020)       (0.019)       (0.019)       (0.019)       (0.020)    
##   residual.sugar                        -0.004        -0.026        -0.016        -0.015        -0.007        -0.003     
##                                         (0.013)       (0.014)       (0.013)       (0.013)       (0.013)       (0.013)    
##   density                                             44.547***     27.216*       17.881         3.630         1.209     
##                                                      (11.990)      (11.367)      (11.702)      (11.812)      (11.975)    
##   volatile.acidity                                                  -1.359***     -1.272***     -1.154***     -1.160***  
##                                                                     (0.096)       (0.099)       (0.100)       (0.100)    
##   pH                                                                              -0.383**      -0.303*       -0.290*    
##                                                                                   (0.119)       (0.119)       (0.119)    
##   sulphates                                                                                      0.628***      0.642***  
##                                                                                                 (0.104)       (0.105)    
##   free.sulfur.dioxide                                                                                         -0.002     
##                                                                                                               (0.002)    
## -------------------------------------------------------------------------------------------------------------------------
##   R-squared                0.227         0.227         0.233         0.319         0.324         0.339         0.340     
##   adj. R-squared           0.226         0.226         0.232         0.318         0.322         0.336         0.337     
##   sigma                    0.710         0.711         0.708         0.667         0.665         0.658         0.658     
##   F                      468.267       234.040       161.879       187.064       152.580       136.047       116.861     
##   p                        0.000         0.000         0.000         0.000         0.000         0.000         0.000     
##   Log-likelihood       -1721.057     -1721.016     -1714.127     -1618.932     -1613.786     -1595.704     -1594.954     
##   Deviance               805.870       805.829       798.915       709.235       704.685       688.926       688.280     
##   AIC                   3448.114      3450.031      3438.254      3249.864      3241.573      3207.408      3207.908     
##   BIC                   3464.245      3471.540      3465.139      3282.127      3279.213      3250.425      3256.302     
##   N                     1599          1599          1599          1599          1599          1599          1599         
## =========================================================================================================================

The variables in this linear model can account for 28% of the variance in the quality of white wine.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The most prominent correlations we’ve discovered were in fact so strong that, when faceted by wine quality, the features displayed the same trends across all wine grades: for density vs residual sugar, the trend was always upward, for density vs alcohol always downward.

For other, less correlated features (alcohol vs residual sugar, alcohol vs total SO2, density vstotal SO2), the trend across the wine grades was also the same, with an exception of best or worst wines, or both, whereby features showed little to no correlation whatsoever.

Were there any interesting or surprising interactions between features?

Since the correlation between density and residual sugar was quite higher than that of density and alcohol (0.839 vs -0.78), I was epsecially interested to see how residual sugar and alcohol were correlated and expected at least a slightly positive correlation. To my surprise, the correlation turned out to be strongly negative (-0.451, second strongest among negative correlations discovered); in fact, it was so strong that a negative downward trend manifested itself across 6 out 7 wine grades represented in the dataset, except for grade 3, where features showed no correlation at all.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I did create a linear model that makes a prediction based on 7 features from the dataset. Further increasing the number of features didn’t yield any significant improvement, so I stopped at this value. Surprisingly enough, the model explains a mere 28% of the variance in the target variable, which is quality. It seems like wine quality is not well supported by its physico-chemical properties. Two things to note here: first, quality of prediction could be improved with more data (right now, it’s less than 5,000 samples); second, there’re some other factors at play, so the model might have benefited from addition of such variables as price of wine, region where it was produced, year it was produced and other things not related to wine chemistry. Trying out other models may also lead to better results. Say, I have a hunch that tree-based methods would do well in this case.


Final Plots and Summary

Plot One

Description One

This box plot supports our finding saying that the strongest positive correlation our main variable of interest is involved in is quality vs alcohol (0.436). An interesting thing here is that for lower wine grades, we can actually observe a negative downward trend that gets reversed only from grade 5 onwards. Thus, for wines of up to grade 5, the lower the alcohol content, the better a wine tends to be; after that wine quality grows linearly with increasing alcohol content.

Moreover, the median (and mean as well) alcohol content of best wines looks significantly different from that of worst wines, which can be used to more or less reliably tell a quality wine from a poor one.

Plot Two

Description Two

When plotted unmodified, the residual sugar distribution is highly skewed and has a long right tail. However, when log-transformed, the distribution becomes bimodal. When I later color-coded the plot, I saw the distribution was in fact bimodal across all the wine grades. Intrigued by this phenomenon, I read a few specialized articles on residual sugar in wines, but couldn’t find any explanation that would satisfy me. Therefore I’m inclined to think, for the lack of proof to the contrary, that it’s just a regional thing specific to Portuguese wines.

Plot Three

Description Three

This faceted scatter plot illustrates the third strongest negative correlation discovered during the analysis - alcohol vs total SO2. Each subplot contains a line of best fit that visually reinforces the trend across wine grades. One interesting observation here is that with best and worst wines, the features display little to no correlation whatsoever, whereas for wines of grades 4 through 8, a clearly negative downward trend manifests itself. It might be an indication of the fact that this particular combination of features is a bad candidate for predicting wine quality. Indeed, when I was building a linear model, alcohol turned out to be the best contributor to the overall quality of prediction, whereas total SO2 added absolutely nothing to improve it and therefore was not included in the resulting model.


Reflection

The dataset I’ve analyzed contains information on almost 5,000 white wines across 11 variables plus the output variable based on sensory data, that is a grade on a scale of 0 to 10 given to each wine sample by professional wine judges. This dataset is restricted to Portuguese wines and contains only their physico-chemical properties.

I began my analysis by building histograms of each feature to understand their distribution. They turned out to be normally distributed, with a few notable exceptions (take residual sugar as an example), where I observed heavy skew and long tails. Log-transforming these variables helped me deal with this abnormality. I also defined thresholds for poor (grade 4 and under) and excellent (grade 8 and over) wines, then subset my dataset using these thresholds and plotted distributions of individual features across poor and excellent wines side by side. This helped me see whether these distributions were very different and identify a few potential candidates that could be useful in telling a low-quality wine from a better one.

I went on to explore pairwise relationships between the features and pick out the most strongly correlated (both positively and negatively) pairs to focus my analysis on them. To my surprise, the main variable of interest - quality - wasn’t involved in any of the strongest correlations identified. I built a few scatter plots and included a line of best fit for each of them to more clearly see the general trend in the data points. Then I added a few box plots that reinforced my earlier findings and offered some new insights.

My greatest success was finding out that alcohol content was the most influential feature that could more or less reliably be used to differentiate between poor and excellent wines. Indeed, when I later built a linear model to predict a wine grade, this feature alone contributed over 70% to the overall prediction quality.

In the final part of my analysis, I used wine grades to color-code and facet a few plots that I’d built previously to see if any variables reinforce each other across any of the wine grades. The main finding here was that in the two most strongly correlated pairs the corelation was so pronounced that the trend stayed the same across all wine grades: it was always upward for density vs residual sugar and downward for alcohol vs density. The situation was a bit different for more weakly correlated pairs: the trend did stay the same across most wine grades, but with worst or best wines, the features I was analyzing displayed little to no correlation at all (for example, alcohol vs total SO2), which signaled these combinations were probably not the best predictors of wine quality. I tested these findings when building a linear model and excluded the worst contributors from the final version.

I’ve also bumped into a couple of obstacles along the way. First, I found out that the residual sugar distribution, when log-transformed, is bimodal across all wine grades. I’ve been struggling to explain this phenomenon for some time and even read a few specialized articles on the topic, but found no satisfactory explanation so far. So I’m inclined to believe this phenomenon is specific to Portuguese wines, since that’s what I’ve been analyzing all along.

Another thing I had difficulties with was the linear model that I’d built. It was able to explain only 28% of the variance in wine quality, which I found to be a pretty poor result. At first, I thought I was doing something wrong and actually spent a couple of days trying to engineer new features and combine them in various ways (to no avail), but then I realized that some other factors were at play and physico-chemical properties alone were not enough of a quality predictor.

And this realization leads me to suggestions on how to improve this analysis. First and foremost, more data would be nice. 5,000 wine samples is alright, but given the number of wines in the world, it’s just a drop in the ocean. Besides, the dataset is restricted to only Portuguese wines, which significantly limits its value and ability to represent the whole population. Second, as I mentioned above, there must be some other features that heavily influence wine quality. Better results might have been obtained if we had information about a region where a wine was produced, the year it was produced, grape type, selling price and wine brand, to name a few. Also, it might be a good idea to test other kinds of models and see how they fare against each other. I guess more powerful models, like SVM or tree-based methods, could have demonstrated impressive results.